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Abstract 

Amino acid repeats (AARs) are abundant in protein sequences. They have particular roles in protein function and 
evolution. Simple repeat patterns generated by DNA slippage tend to introduce length variations and point muta- 
tions in repeat regions. Loss of normal and gain of abnormal function owing to their variable length are potential 
risks leading to diseases. Repeats with complex patterns mostly refer to the functional domain repeats, such as 
the well-known leucine-rich repeat and WD repeat, which are frequently involved in protein- protein interaction. 
They are mainly derived from internal gene duplication events and stabilized by 'gate-keeper' residues, which play 
crucial roles in preventing inter-domain aggregation. AARs are widely distributed in different proteomes across a 
variety of taxonomic ranges, and especially abundant in eukaryotic proteins. However, their specific evolutionary 
and functional scenarios are still poorly understood. Identifying AARs in protein sequences is the first step for the 
further investigation of their biological function and evolutionary mechanism. In principle, this is an NP-hard prob- 
lem, as most of the repeat fragments are shaped by a series of sophisticated evolutionary events and become 
latent periodical patterns. It is not possible to define a uniform criterion for detecting and verifying various repeat 
patterns. Instead, different algorithms based on different strategies have been developed to cope with different 
repeat patterns. In this review, we attempt to describe the amino acid repeat-detection algorithms currently avail- 
able and compare their strategies based on an in-depth analysis of the biological significance of protein repeats. 

Keywords: amino acid repeat; detection algorithm; low complexity sequence; repeat containing protein; protein domain 
repeats 



INTRODUCTION 

Amino acid repeats (AARs) are abundant in protein 
sequences either as periodic elements in structural 
proteins such as collagens, keratins, silk and cell 
wall proteins, or as structural modules in functional 
proteins such as transcription factors, receptors, ion 
channels, histones, ubiquitins and calcium storage 
proteins. Table 1 shows some well-known examples 
of human repeat-containing proteins (RCPs) gath- 
ered in the UniProt/Swiss-Prot Knowledgebase 
(http://www.uniprot.org/). For example, the 
major prion protein (PRIO_HUMAN) contains an 
N-terminal repeat region with several octamers 
(PHGGGWGQ); the extra-embryonic spermato- 
genesis homeobox 1 protein (ESX1_HUMAN) has 



a sequence motif PPxxPxPPx repeated nine times 
and the alpha- 1 type I collagen protein contains a 
repeat of various lengths of the periodic tri-amino 
acid GPP. The giant muscle protein Titin composed 
of 34350 amino acid residues (TITIN_HUMAN) 
contains several types of repeating domains. Single 
amino acid repeats (SAARs) are also common, such 
as the polyQ repeats in the Forkhead box protein 
P2 (FOXP2_HUMAN), the androgen receptor 
(ANDR_HUMAN) and the Huntington's disease 
(HD) protein (HD_HUMAN) . Other SAARs 
including polyL, polyA and polyH can also be 
found in many other proteins. RCPs are distributed 
in all life kingdoms, and especially abundant in eu- 
karyotes [1]. 
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Table I: Some examples of AARs in human proteins 



UniProt ID 


Protein 


AA 


Repeat pattern 


SECRHUMAN 


Secretin 


121 


polyL 


PRIOHUMAN 


Major prion protein 


253 


(PHGGGWGQ) 4 


ANKRI HUMAN 


Ankyrin repeat domain-containing protein 1 


319 


Ankyrin repeat 


CASQ2HUMAN 


Calsequestrin-2 


399 


D/E-Rich 


ESXIHUMAN 


Homeobox protein ESXI 


406 


(PPxxPxPPx) 9 


WDRI HUMAN 


WD repeat-containing protein 1 


606 


WD repeat 


UBCHUMAN 


Polyubiquitin-C 


685 


Ubiquitin 


FOXP2HUMAN 


Forkhead box protein P2 


715 


polyQ 


LRRNI HUMAN 


Leucine-rich repeat neuronal protein 1 


716 


Leucine Rich Repeat 


ANDRHUMAN 


Androgen receptor 


919 


polyQ, polyG, polyP 


SRBP2HUMAN 


Sterol regulatory element-binding protein 2 


1141 


polyS, (PQ) 4 , (SGSS) 2 


BRD4HUMAN 


Bromodomain-containing protein 4 


1362 


polyP, polyH, polyQ, K-Rich, S-Rich 


COIAIHUMAN 


Collagen alpha- 1(1) chain 


1464 


(GPP)„ 


CACIAHUMAN 


Brain calcium channel 1 


2505 


polyQ, polyH, polyG 


HDHUMAN 


Huntington disease protein 


3142 


polyQ, polyP, polyT, polyE, HEAT domain 


MLL2HUMAN 


Histone-lysine N-methyltransferase MLL2 


5537 


(S/P-P-P-E/P-E/A) l5 


TITINHUMAN 


Tit in 


34350 


Several types of repeating domains: 
TPRWD RCCI PEVK Kelch Z Ig repeats 



It is known that some AARs such as the leucine- 
rich repeats (LRRs) form the structural framework 
for protein— protein interaction, and the repeat frag- 
ment in zinc finger transcription factors binds to cis- 
elements of DNA promoters. AARs can also cause 
problems such as the mis-folding of prion proteins 
[2]. Furthermore, modification of repeat length may 
introduce abnormal function. A typical case is the 
expansion of polyQ, resulting in several neurological 
disorders such as mental retardation, HD, inherited 
ataxias and muscular dystrophy. 

Classification of amino acid repeat 
patterns at sequence level 

Mathematical and statistical methodologies can be 
applied to study the particular functional and evolu- 
tionary background of an AAR. Several approaches 
have been proposed to classify AARs into different 
categories depending on the characteristics of repeat 
units, including the sequence similarity among repeat 
units, the distance between adjacent repeat units and 
the complexity of the sequence pattern of the repeat 
units. 

The first approach is to classify AARs according 
to the similarity among the repeat units. Based on 
this approach, AARs can be classified into two main 
groups: perfect repeats and imperfect repeats. 
The repeat units in perfect repeat fragments are iden- 
tical, e.g. AAAAAAA and PQPQPQPQ, whereas 
the repeat units in imperfect repeat fragments are 
not exactly the same, e.g. AAWAAAA and 
QQQMLQQQFL. Imperfect repeats with highly 



variable, but still recognizable, repeat units are also 
called divergent repeats. 

The second approach for repeat classification is 
based on the distance between adjacent units. AARs 
can be classified as tandem repeats (TRs) or non- 
tandem repeats (NTRs). The units in TRs are con- 
tinuously distributed in the repeat sequence, whereas 
the units in NTRs are sequentially interspersed. 

The third approach takes the complexity of the 
sequence pattern of the repeat units into consider- 
ation. Based on this approach, AARs can be roughly 
classified as simple repeats or complex repeats. Simple 
repeats generally refer to the continuous or inter- 
rupted runs of single amino acid residues or short 
peptides. The regions in a protein sequence contain- 
ing simple repeats are often called simple sequences 
(SSs) or low complexity regions (LCRs). On the 
other hand, most of the complex repeats usually 
have sophisticated patterns of repeat units with vari- 
able lengths ranging from 10 to >100 residues, and 
these complex repeats patterns are frequently recog- 
nized as repeated protein domains [3]. 

In practice, it is rather difficult to strictly distinguish 
the different classes owing to the complicated patterns 
of AARs. For example, some domain repeats also con- 
tain SSs, such as the abundant leucine residues found 
in an LRR domain. And in the case of point mutations 
or insertions/deletions (INDELs), the original per- 
fectly repeated units in proteins could gradually 
evolve into non-perfect tandem repeats (NPTRs). 

The above approaches used to classify AARs are 
all based on the protein sequence. However, they are 
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insufficient to reveal the biological significance of 
AARs, as proteins play their functional roles by fold- 
ing into particular secondary and tertiary structures, 
which are difficult to deduce through amino acid 
patterns at sequence level. Data from several experi- 
ments show that proteins with similar tertiary struc- 
tures may share low sequence identity [4, 5]. And 
similar functional domains of proteins do not neces- 
sarily correspond to recognizable sequence repeat 
patterns [3, 6—8]. Therefore, in-depth study of pro- 
tein repeats requires better understanding of the cor- 
respondence of repeat sequences with their structures 
and functions. In addition, the acquisition of such 
biological knowledge is more sophisticated than 
simply classifying sequential repeat data. 

Biological significance of different 
patterns of AARs 

Biologically, different amino acid repeat patterns 
imply different functional and evolutionary back- 
grounds. Repeats with simple patterns, such as 
single AARs, mainly exist in intrinsically unstruc- 
tured regions (IURs) of proteins [9, 10]. Such 
protein regions that do not fold into a 3D structure 
commonly have functions related to molecular rec- 
ognition and molecular assembly [11, 12]. Single 
amino acid or trinucleotide repeats like polyQ are 
involved in neurodegenerative diseases such as HD 
[13], where their length variations often result in 
either loss of normal or gain of abnormal func- 
tion[14, 15]. 

Most SAARs are presumed to be originally 
derived from replicative DNA slippage [16] in the 
coding region. Expansion of some SAARs might also 
result from unequal chromosomal crossover, such as 
the polyA in the human HOX13 gene [17]. In gen- 
eral, perfect amino acid runs are inherently mutable 
and are frequently interrupted by point mutations 
[18] to become simple sequences [19]. 

In addition to SAARs, sequential tandem repeats 
(PTRs and NPTRs) with highly similar units are 
prevalent in protein sequences. We have found 
that ~13% of all proteins deposited in the public 
protein databases contain at least one tandem 
repeat fragment. And >40% of the tandem repeats 
are PTRs, while ~60% PTRs are single amino acid 
runs [1]. Errors in sequencing and automatic anno- 
tation procedures might have introduced some 
false-positive PTRs into the public protein knowl- 
edgebase. However, this cannot undermine the bio- 
logical significance of frequently occurring PTRs in 



protein sequences, especially considering the fact that 
functional PTRs are being continuously experimen- 
tally identified, and most of them are conserved 
among orthologous proteins [20—22]. 

Consistent with this scenario, conservation of 
amino acid tandem repeats is a strong indication for 
biological relevance. The phylogenetically conserved 
repeat fragments among orthologous proteins should 
have a conserved function, such as the conserved 
polyQ regions in primate FOXP2 proteins [23]. In 
contrast, however, variable repeat unit length in cor- 
responding regions of orthologous proteins indicates 
a different scenario. These repeats are probably going 
through a rapid change driven by selection [24]. 
More interestingly, tandem repeats have been 
shown to play an important role in micro-evolution 
by catalysing the rapid production of genetic and 
phenotypic variation among organisms [25—28]. 

Repeats with complex patterns have compara- 
tively stable structures and conserved functions, 
which are generally called domain repeats. Domain 
repeats are among the most common protein motifs 
in the Pfam database [29], such as LRRs, Zinc finger 
repeats, Ankyrin repeats and Tetratricopeptide re- 
peats (TPRs) [30]. These domain repeats are 
mostly involved in transcription regulation, cell- 
cycle control and signal transduction [31—34] and 
widely spread in the proteomes of different species 
across different life kingdoms [35]. Many genes con- 
taining these domain repeats in the coding region are 
significant in certain diseases [36], as sequence iden- 
tity increases the chance of protein aggregation [37] 
and mis-folding. Domain repeats are thought to have 
evolved through internal gene duplications arising 
from recombination events [3, 38], such as unequal 
crossing over [39] and exon shuffling [40]. The du- 
plications may involve several domains at a time [3, 
41]. In addition, a number of specific sequence-based 
signals such as the 'gate-keeper' residues [41] play a 
crucial role in preventing inter-domain aggregation. 
Therefore, these repeat patterns are generally obscure 
at sequence level, and a sophisticated search is 
required to detect them. 



REPEAT DETECTION STRATEGIES 

During the past decade, several strategies for the 
identification of AARs from protein sequences 
have been reported. Among these approaches, the 
three major ones are self-comparison, pattern recog- 
nition and complexity measurement. Table 2 shows 
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Table 2: Repeat detection algorithms 



Method 


Repeat type a 


Ref 


Availability 


Self-comparison 








REP 


Domain 


[42] 


http://www.embl.de/~andrade/papers/rep/search.html 


COACH 


Domain 


[43] 


http://www.drive5.com/lobster/ 


TPRpred 


Domain 


[44] 


http://tprpred.tuebingen.mpg.de/ 


REPRO 


Domain 


[45] 


http://www.ibi.vu.nl/programs/reprowww/ 


TRUST 


Divergent 


[46] 


http://www.ibi.vu.nl/programs/trustwww/ 


Internal Repeat Finder 


Divergent 


[47] 


http://nihserver.mbi.ucla.edu/Repeats/ 


HHrep 


Divergent 


[48] 


http://hhrep.tuebingen.mpg.de/hhrep/ 


RADAR 


Divergent 


[49] 


http://www.ebi.ac.uk/Tools/Radar/ 


HHrepID 


Divergent 


[50] 


http://toolkit.tuebingen.mpg.de/hhrepid/ 


Pattern recognition 








REPETITA 


Solenoid 


[51] 


http://protein.bio.unipd.it/repetita/ 


LSTM 


Domain 


[52] 


http://www.bioinf.jku.at/software/LSTM_protein/ 


ARD 


Alpha-Rod 


[53] 


http://www.ogic.ca/projects/ard/ 


Complexity measurement 








SIMPLE 


Simple 


[19] 


http://www.biochem.ucl.ac.uk/bsm/SIMPLE/ 


GBA 


Simple 


[54] 


xli@cise.ufl.edu 


Others 








XSTREAM 


NPTR 


[55] 


http://jimcooperlab.mcdb.ucsb.edu/xstream/ 


Apriod 


PPP 


[56] 


hwan@mindgen.org 


LocRepeat 


PPP 


[57] 


http://www.cs.cityu.edu.hk/~lwang/software/LocRepeat/ 


REPfind 


NPTR 


[58] 


adebiyi@informatik.uni-tuebingen.de 


Reptile 


Perfect 


[59] 


http://reptile.unibe.ch/ 


SUFFIX 


Perfect 


[60] 


http://www.cs.ucdavis.edu/~gusfield/strmat.html 



a NPTR= non-perfect tandem repeat; PPP= pseudo-periodic partitions. 



the algorithms and publicly available tools including 
online resources that can be used to detect AARs of 
various types. 

In the following section, we will give a brief intro- 
duction to the amino acid repeat-detection strategies 
focusing on the general principles behind these 
strategies. 

The self-comparison strategy 

One of the most intuitive strategies to detect repeat 
patterns in protein sequences is the self-comparison 
method. The idea of this approach is rather simple, 
i.e. comparing a protein sequence to itself. Sequence 
comparison is a fundamental bioinformatics method 
that has been extensively used to search similar 
regions among biological sequences. The global 
sequence alignment method was first proposed in 
the 1970s [61] and focuses on finding the optimal 
alignment of two entire biological sequences using 
dynamic programming. Soon after, the Smith- 
Waterman local alignment algorithm [62] was 
developed to recognize the better aligned sub- 
regions between two sequences in order to show 
meaningful biological relevance. 

On aligning a sequence with itself for the purpose 
of identifying repeat patterns, the sub-optimal 



alignments become obscured by the best (and most 
obvious) alignment. This optimal alignment should 
be excluded from the initial search. The reliability of 
identifying sub-optimal alignments of protein se- 
quences using the dynamic programming method 
has been evaluated [62] . A very distinguishing feature 
of this method is the use of a scoring system that 
gives scores to paired amino acids and penalties to 
unmatched gaps. Substitution matrices such as PAM 
[63] and BLOSUM [64] are the basis of the scoring 
system and represent the specific evolutionary rele- 
vance among different amino acids. More specifically 
tuned scoring matrices have also been proposed. 
These matrices take special features of amino acids 
such as polarity, electrostatic charge, structure, mo- 
lecular volume and codon bias [65] into account. 
One of the greatest advantages of using a scoring 
system for identifying sub-optimal alignments is 
that statistical models can be applied to define reliable 
criteria [66, 67]. 

In principle, the self-alignment repeat-detection 
methods are the extension of an alignment-based 
homology-detection approach. Thus, they have 
inherited characteristics that are more suitable for de- 
tecting divergent internal repeats in protein se- 
quences. The units of these repeats generally have 
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low identities and ambiguous boundaries, but share 
evolutionarily conserved sites or motifs, which are 
presumed to have crucial functions. As such, the 
accurate definition of repeat length and repeat 
number according to substantial biological signifi- 
cance is a sophisticated problem. And this is espe- 
cially true for detecting repeat patterns without prior 
knowledge, also called 'de novo' repeat detection. 
On the other hand, the algorithms depending on 
prior knowledge, such as REP, COACH and 
TPRpred [42—44], generally search repeat patterns 
from sequence databases by profiles constructed 
with known repeat families using hidden Markov 
models (HMMs) [68]. Therefore, the repeat patterns 
identified by these programs are usually well-known, 
and some of them are experimentally studied func- 
tional protein domain repeats. 

It is generally believed that detecting repeat pat- 
terns with a self-alignment-based method is a feasible 
strategy. However, it also has some flaws and limi- 
tations. First, the computational complexity of per- 
forming self-alignment is high. The general 
complexity for a sequence with n amino acids is O 
(n 2 ) for both time and space, which will increase 
exponentially with the increase of the sequence 
length. Fortunately, this problem is not too serious 
for protein sequences, as their average length is 
around 320 AAs [69]. And the computational cap- 
acity of current computer hardware is powerful 
enough to handle this problem within acceptable 
time and space. In addition, several optimization stra- 
tegies have been recently applied to sequence 
alignments, such as the implementation of the 
Smith— Waterman algorithm with the new technol- 
ogy of graphics processing units (GPUs) [70], and the 
parallel computing version of the PvEPRO [71] al- 
gorithm [72] can handle much longer sequences 
within a reasonable time. 

One of the main purposes for detecting AARs is 
to find novel repeat patterns and infer their func- 
tional and evolutionary roles. As the majority of 
repeat patterns in protein sequences have not been 
well studied, de novo repeat-detection algorithms are 
more widely used, such as PEPRO, Internal Repeat 
Finder, RADAR, TRUST, HHrep and HHrepID 
[45—50, 56, 57]. All of them identify repeats using 
the self-comparison strategy, but differ in some as- 
pects. For example, Internal Repeat Finder assumes 
that the statistically significant sub-optimal alignment 
scores should have a Poisson distribution [47]. 
TRUST uses the particular strategy on sub-optimal 



alignments, which could increase the chance and 
reliability to identify divergent repeats [46]. HHrep 
[48] and its optimized version HHrepID [50] com- 
pares a sequence with itself by the HMM— HMM 
[73] strategy, which looks for the sub-optimal align- 
ments using a profile HMM constructed by iterations 
of PSI-BLAST [74]. 

The pattern recognition strategy 

The second strategy to detect AARs from protein 
sequences uses the conventional method of pattern 
recognition. The two main algorithms of this strat- 
egy are the discrete Fourier transform (DFT) and 
neural networks. 

DFT has been widely applied in the research area 
of signal processing. Generally, it can decompose 
signals into constituent frequencies, so that the cryp- 
tic patterns hidden in the signals could be analysed 
intuitively. Early studies showed that DFT can be 
used to detect periodic patterns in collagen protein 
[75], but also has some fundamental difficulties 
which limit its usage [45]. The accuracy of DFT- 
based methods is easily biased by the length variation 
of the repeat units caused by mutations or INDELs, 
as this will weaken the periodical pattern of the 
transformed Fourier spectral amplitudes. 

Some recent algorithms make efforts to provide 
better discrimination on Fourier spectral amplitudes 
using newly developed methods. For example, 
REPETITA yields better accuracy than self- 
alignment methods on detecting solenoid repeats 
by introducing several optimized strategies of the 
DFT-based method [51]. In addition, the stationary 
wavelet packet transform has been widely used in 
bioinformatics and computational biology in recent 
years [76]. As a state of the art optimization DFT 
algorithm [77], it has been shown to have good qual- 
ity on detecting protein repeat patterns [78]. 

The neural network-based method is another 
well-studied pattern-recognition strategy, which is 
also capable of identifying similar patterns in protein 
sequences [79] . A well-established neural network is 
able to associate homologous patterns in the protein 
sequence with the input patterns and can be trained 
to adapt the patterns. Several neural network algo- 
rithms show good accuracy and time efficiency on 
protein homologue detection. LSTM is able to com- 
bine amino acid properties with patterns and does 
not rely on pre-defined scoring matrices for similarity 
measurements [52] . The ARD neural network is de- 
signed to identify specific alpha-rod repeat patterns 
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and has been applied to the analysis of Huntingtin 
protein sequences [53]. 

The complexity measurement strategy 

The third approach of identifying AARs takes com- 
plexity measurement into consideration. LCRs are 
widely distributed in protein sequences. LCRs com- 
monly contain particular repeat patterns that have 
continuous repetitions of very short units, such as 
the SAARs and cryptically simple sequences [19]. 
Apparently, these repeats have special functional 
and evolutionary properties that differ from the re- 
peats with more complex patterns and longer units. 
Their typical short unit length makes both the 
self-comparison- and the pattern recognition-based 
strategies less well suited to identify LCR repeats 
efficiently. 

Fortunately, several algorithms have been intro- 
duced to detect repeats involved in LCRs, most of 
them using a strategy to measure the complexity of 
sequences within a sliding window. As for complex- 
ity measuring, SIMPLE [19] awards simplicity score 
to the central amino acid of each window, and is 
most suitable for detecting short unit cryptic repeats. 
SEG [80], DSR [81] and CARD [82] are based on 
Shannon entropy [83], which displays several limita- 
tions when decoding complex protein sequences 
(43). 

The main drawback of sliding windows-based al- 
gorithms is that they all require a pre-specified 
window size, and repeats that are longer or shorter 
than the window are not detectable. On the other 
hand, non-sliding window algorithms show more 
flexibility on detecting repeats in LCRs. GBA [54] 
constructs a graph for each protein sequence, and 
finds short subsequences as LCR candidates through 
traversing. Coronado [84] introduces the composi- 
tion-modified scoring matrices to identify LCRs 
within cell wall proteins of fungi. These algorithms 
are an important complement to the sliding 
window-based algorithms. 

Other strategies 

As described above, the self-comparison strategy and 
the pattern recognition strategy are mostly suitable 
for detecting divergent repeats, whereas the com- 
plexity measurement strategy is mostly suitable for 
detecting simple unit repeats. In addition, exclusive 
and optimized strategies for sequential tandem re- 
peats are also particularly useful. Sequential tandem 
repeats implicated in the amino acid fragments with 



tandem repeat patterns are comparatively more ex- 
plicit than divergent repeats. They are widely spread 
in many proteomes across wide taxonomic ranges, 
but are still insufficiently studied. 

Hamming distance [85] and edit distance, also 
called Levenshtein distance [86], are widely used 
for measuring the similarity of sequential tandem 
repeats [87—90]. Differing from hamming distance, 
which only accounts for point mutations, edit dis- 
tance-measuring algorithms also consider insertions 
and deletions. In addition, Apriod [56] and 
LocRepeat [57] focus on finding the 'pseudo- 
periodic partitions', which are gradually evolved pat- 
terns among repeat units. Given that NPTRs are 
originally evolved from PTRs, Xstream [55] and 
REPfind [58] detect NPTRs based on the extension 
of exact repeats seeds, which could decrease the 
computational complexity of both time and space. 

Most of the repeat-detection algorithms can iden- 
tify PTRs together with other repeat patterns inci- 
dentally. But as some of the PTRs are nested in 
larger NPTR fragments, which can hardly be distin- 
guished by the common strategies, an exclusive 
algorithm for detecting PTRs is also necessary. For 
example, the suffix tree-based strategy is supportive 
to identify all PTRs in a protein sequence with linear 
time complexity [60]. Reptile uses a 'brute-force' 
strategy to detect PTRs from the proteins of parasite 
antigens [59]. Following the definition of statistically 
significant repeat runs in protein sequences [91], the 
cut-off sizes of five, four, three and two of the repeat 
unit repetitions are common criteria for identifying 
mono-amino, di-amino, tri-amino and all other 
repeats, respectively. 



SUMMARY AND PERSPECTIVE 

Identifying repeat patterns in proteins is the first step 
towards the understanding of their physiological 
function and evolutionary mechanism. During the 
evolution process, these patterns become so intricate 
that no single algorithm is adequate to identify all of 
them. There is no doubt that an in-depth investiga- 
tion of their biological background is required to 
choose proper algorithms for the identification of 
specific patterns. In general, self-comparison algo- 
rithms are suitable to detect de novo repeats with com- 
plex patterns. Pattern recognition-based algorithms 
are suitable to detect repeats with low sequence 
identities but high intrinsic biological similarities. 
Complexity measurement-based algorithms can be 
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Key Points 

• Amino acid repeats are abundant in protein sequences. 

• They can be classified into different categories depending on the 
characters of the repeat units. 

• Different amino acid repeat patterns imply different functional 
and evolutionary backgrounds. 

• The three major approaches for detection of amino acid repeats 
are the self-comparison strategy, the pattern recognition strat- 
egy and the complexity measurement strategy. 



applied to detect repeats with simple patterns 
involved in LCRs. For the tandem repeats that 
have more sequentially repetitive patterns, one 
should consider the strategies that measure the 
similarity of repeat units by edit or hamming 
distance. 

The biological significance of protein repeats has 
been discussed for years. Internal duplication in gen- 
omes is one of the most important evolutionary 
mechanisms for species to adapt the environment 
[92—94]. As a result, repetitive patterns at the DNA 
level such as interspersed microsatellites and tandem 
tri-nucleotide repeats are prevalent. Intragenic 
repeats are presumed to have potential roles on gen- 
erating functional variability [95, 96]. And the re- 
peats in coding regions corresponding to AARs are 
more likely to go through adaptive competition [24, 
97, 98]. Therefore, large amount of repeats in pro- 
teins is less likely to be regarded as junk proteins' 
[99], which merely have non-essential roles. At the 
same time, their variable characters and vulnerabil- 
ities to disorder and diseases has been a scientific 
puzzle for a long time. Frequently asked questions 
are: Is the characteristics of similar repeat patterns 
coherent in different proteomes across different life 
kingdoms? Could the functional and evolutionary 
roles of certain repeats correspond to their particular 
characters, such as position bias, GC content con- 
strains and codon usage? How could the conserved 
functions of particular repeats have been evolved by 
selection? And what are the structure and 
sequence-based strategies to prevent repeats from 
aggregation? 

The insufficient understanding of protein repeats is 
not only due to the difficulty of identification, but 
also because of the lack of integrated repository for 
large-scale investigation and comparison of repeats 
among a variety of proteomes across different king- 
doms. To that end, we developed ProRepeat 
(http://prorepeat.bioinformatics.nl), which inte- 
grates non-redundant tandem repeats detected by 
several algorithms from the UniProt [69] and 
RefSeq [100] protein databases and offers powerful 
analysis tools for finding biologically interesting 
properties of query results. In addition, we also inte- 
grated ProRepeat with ProGMap — a tool we 
developed for the integration of annotation resources 
for protein orthology [101]. With this set-up, we 
will be making large-scale orthologous comparisons 
on protein repeats over a broad taxonomy range 
especially eukaryotes in the near future. 
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